ML 08L - Hyperopt Lab(Python)
Loading...

Hyperopt Lab

The Hyperopt library allows for parallel hyperparameter tuning using either random search or Tree of Parzen Estimators (TPE). With MLflow, we can record the hyperparameters and corresponding metrics for each hyperparameter combination. You can read more on SparkTrials w/ Hyperopt.

Spark Logo Tiny In this lesson you:

  • Learn how to distribute tuning tasks when training a single-node machine learning model by using SparkTrials class, rather than the default Trials class.

SparkTrials fits and evaluates each model on one Spark executor, allowing massive scale-out for tuning. To use SparkTrials with Hyperopt, simply pass the SparkTrials object to Hyperopt's fmin() function.

%pip install hyperopt
Python interpreter will be restarted. Collecting hyperopt Downloading hyperopt-0.2.5-py2.py3-none-any.whl (965 kB) Requirement already satisfied: six in /databricks/python3/lib/python3.7/site-packages (from hyperopt) (1.14.0) Collecting networkx>=2.2 Downloading networkx-2.5-py3-none-any.whl (1.6 MB) Requirement already satisfied: scipy in /databricks/python3/lib/python3.7/site-packages (from hyperopt) (1.4.1) Collecting tqdm Downloading tqdm-4.52.0-py2.py3-none-any.whl (71 kB) Collecting future Downloading future-0.18.2.tar.gz (829 kB) Requirement already satisfied: numpy in /databricks/python3/lib/python3.7/site-packages (from hyperopt) (1.18.1) Collecting cloudpickle Using cached cloudpickle-1.6.0-py3-none-any.whl (23 kB) Requirement already satisfied: decorator>=4.3.0 in /databricks/python3/lib/python3.7/site-packages (from networkx>=2.2->hyperopt) (4.4.1) Building wheels for collected packages: future Building wheel for future (setup.py): started Building wheel for future (setup.py): finished with status 'done' Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491058 sha256=65596ac126cc7a65a0e8862ff15891eb171c75329b5ef4a814503ff9ba0a5930 Stored in directory: /root/.cache/pip/wheels/56/b0/fe/4410d17b32f1f0c3cf54cdfb2bc04d7b4b8f4ae377e2229ba0 Successfully built future Installing collected packages: networkx, tqdm, future, cloudpickle, hyperopt Successfully installed cloudpickle-1.6.0 future-0.18.2 hyperopt-0.2.5 networkx-2.5 tqdm-4.52.0 Python interpreter will be restarted.
%run "../Includes/Classroom-Setup"

Read in a cleaned version of the Airbnb dataset with just numeric features.

from sklearn.model_selection import train_test_split
import pandas as pd
 
#df = pd.read_csv("/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv").drop(["zipcode"], axis=1)
dbutils.fs.cp('dbfs:/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv', 'file:/tmp/airbnb-cleaned-mlflow.csv')
df = pd.read_csv('file:/tmp/airbnb-cleaned-mlflow.csv').drop(["zipcode"], axis=1)
 
# split 80/20 train-test
X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1),
                                                    df[["price"]].values.ravel(),
                                                    test_size = 0.2,
                                                    random_state = 42)

Now we need to define an objective_function where you evaluate the random forest's predictions using R2.

In the code below, compute the r2 and return it along with STATUS_OK (remember we are trying to maximize R2, so we need to return it as a negative value).

# TODO
from sklearn.ensemble import RandomForestRegressor
from hyperopt import STATUS_OK
  
def objective_function(params):
 
  # set the hyperparameters that we want to tune:
  max_depth = params["max_depth"]
  max_features = params["max_features"]
 
  regressor = RandomForestRegressor(max_depth=max_depth, max_features=max_features, random_state=42)
  regressor.fit(X_train, y_train)
 
  # Evaluate predictions
  r2 = regressor.score(X_test, y_test)
 
  # Note: since we aim to maximize r2, we need to return it as a negative value ("loss": -metric)
  return {"loss": -r2, "status": STATUS_OK}

We need to define a search space for HyperOpt. Let the max_depth vary between 2-10, and max_features be one of: "auto", "sqrt", or "log2".

# TODO
from hyperopt import hp
 
search_space = {
  "max_depth": hp.randint("max_depth", 2, 10),
  "max_features": hp.choice("max_features", ["auto", "sqrt", "log2"])
}

Instead of using the default Trials class, you can leverage the SparkTrials class to trigger the distribution of tuning tasks across Spark executors. On Databricks, SparkTrials are automatically logged with MLflow.

SparkTrials takes 3 optional arguments, namely parallelism, timeout, and spark_session. You can refer to this page to read more.

In the code below, fill in the fmin function.

# TODO
from hyperopt import fmin, tpe, STATUS_OK, SparkTrials
 
# the number of models we want to evaluate
num_evals = 8
# set the number of models to be trained concurrently
spark_trials = SparkTrials(parallelism=2)
best_hyperparam = fmin(fn = objective_function, 
                       space = search_space,
                       algo = tpe.suggest, 
                       trials = spark_trials,
                       max_evals = num_evals)
 
best_hyperparam
0%| | 0/8 [00:00<?, ?trial/s, best loss=?] 12%|█▎ | 1/8 [00:03<00:21, 3.03s/trial, best loss: -0.5528747084210368] 25%|██▌ | 2/8 [00:05<00:16, 2.72s/trial, best loss: -0.5528747084210368] 38%|███▊ | 3/8 [00:06<00:11, 2.20s/trial, best loss: -0.6645135076698265] 50%|█████ | 4/8 [00:09<00:09, 2.45s/trial, best loss: -0.6645135076698265] 62%|██████▎ | 5/8 [00:10<00:06, 2.01s/trial, best loss: -0.6645135076698265] 75%|███████▌ | 6/8 [00:13<00:04, 2.31s/trial, best loss: -0.6724292330425153] 88%|████████▊ | 7/8 [00:15<00:02, 2.22s/trial, best loss: -0.6752642381859784] 100%|██████████| 8/8 [00:16<00:00, 1.85s/trial, best loss: -0.6752642381859784] 100%|██████████| 8/8 [00:16<00:00, 2.01s/trial, best loss: -0.6752642381859784] Total Trials: 8: 8 succeeded, 0 failed, 0 cancelled. Out[9]: {'max_depth': 8, 'max_features': 0}

Now you can compare all of the models using the MLflow UI.

To understand the effect of tuning a hyperparameter:

  1. Select the resulting runs and click Compare.
  2. In the Scatter Plot, select a hyperparameter for the X-axis and loss for the Y-axis.